50 research outputs found

    Current challenges in simulations of HPC systems

    Full text link
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Simulation of many-core HPC systems is nowadays an active and fruitful area of research. Recent and future proposals are driven by the need of a fast, efficient, and comprehensive simulation framework. This simulation framework should be complete in several ways. First, it should model a wide range of components and provide the mechanisms necessary to plug-in more components as needed. Second, it should allow the designer to focus on critical components while avoiding a large part of the simulation complexity. Each of these components should be able to be evaluated with multiple models with distinct detail levels, ranging from simply analytical models to detailed cycle-accurate simulations. Third, a complete simulation framework should provide a wide range of metrics of interest for the designer and the market. Finally, support for heterogeneous architectures combining CPU and GPU, as well as some degree of reconfigurability is surely required. Building such titanic framework is and will be a collaborative process between researchers around the globe and it is expected to be a hot research topic for the next years.This work has been supported by the Spanish Ministerio de Economía y Competitividad (MINECO), by FEDER funds through Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program Award.Petit Martí, SV. (2015). Current challenges in simulations of HPC systems. IEEE Catalog Number: CFP1578H-CDR. https://doi.org/10.1109/HPCSim.2015.7237110

    Efficient Home-Based protocols for reducing asynchronous communication in shared virtual memory systems

    Full text link
    En la presente tesis se realiza una evaluación exhaustiva de ls Sistemas de Memoria Distribuida conocidos como Sistemas de Memoria Virtual Compartida. Este tipo de sistemas posee características que los hacen especialmente atractivos, como son su relativo bajo costo, alta portabilidad y paradigma de progración de memoria compartida. La evaluación consta de dos partes. En la primera se detallan las bases de diseño y el estado del arte de la investigación sobre este tipo de sistemas. En la segunda, se estudia el comportamiento de un conjunto representativo de cargas paralelas respecto a tres ejes de caracterización estrechamente relacionados con las prestaciones en estos sistemas. Mientras que la primera parte apunta la hipótesis de que la comunicación asíncrona es una de las principales causas de pérdida de prestaciones en los Sistemas de Memoria Virtual Compartida, la segunda no sólo la confirma, sino que ofrece un detallado análisis de las cargas del que se obteiene información sobre la potencial comunicación asíncrona atendiendo a diferentes parámetros del sistema. El resultado de la evaluación se utiliza para proponer dos nuevos protocolos para el funcionamiento de estos sistemas que utiliza un mínimo de recursos de hardware, alcanzando prestaciones similares e incluso superiores en algunos casos a sistemas que utilizan circuitos hardware de propósito específico para reducir la comunicación asíncrona. En particular, uno de los protocolos propuestos es comparado con una reconocida técnica hardware para reducir la comunicación asíncrona, obteniendo resultados satisfactorios y complementarios a la técnica comparada. Todos los modelos y técnicas usados en este trabajo han sido implementados y evalados utilizando un nuevo entorno de simulación desarollado en el contexto de este trabajo.Petit Martí, SV. (2003). Efficient Home-Based protocols for reducing asynchronous communication in shared virtual memory systems [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2908Palanci

    Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors

    Get PDF
    © 2020 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] Resource sharing is a critical issue in simultaneous multithreading (SMT) processors as threads running simultaneously on an SMT core compete for shared resources. Symbiotic job scheduling, which co-schedules applications with complementary resource demands, is an effective solution to maximize hardware utilization and improve overall system performance. However, symbiotic job scheduling typically distributes threads evenly among cores, i.e., all cores get assigned the same number of threads, which we find to lead to sub-optimal performance. In this paper, we show that asymmetric schedules (i.e., schedules that assign a different number of threads to each SMT core) can significantly improve performance compared to symmetric schedules. To leverage this finding, we propose thread isolation, a technique that turns symmetric schedules into asymmetric ones yielding higher overall system performance. Thread isolation identifies SMT-adverse applications and schedules them in isolation on a dedicated core to mitigate their sharp performance degradation under SMT. Our experimental results on an IBM POWER8 processor show that thread isolation improves system throughput by up to 5.5 percent compared to a state-of-the-art symmetric symbiotic job scheduler.Josue Feliu has been partially supported through a postdoctoral fellowship by the Generalitat Valenciana (APOSTD/2017/052). Additional support has been provided by the Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grant RTI2018-098156-B-C51, as well as, by the Universitat Politenica de Valencia through the "Ayudas a Primeros Proyectos de Investigacion" (PAID-06-18) under grant SP20180140. Lieven Eeckhout's research program is supported through FWO grants no. G.0434.16N and G.0144.17N, and the European Research Council (ERC) Advanced Grant agreement no. 741097.Feliu-Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Eeckhout, L. (2020). Thread Isolation to Improve Symbiotic Scheduling on SMT Multicore Processors. IEEE Transactions on Parallel and Distributed Systems. 31(2):359-373. https://doi.org/10.1109/TPDS.2019.2934955S35937331

    Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance

    Full text link
    "© 2019 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works."[EN] To support the massive amount of memory accesses that GPGPU applications generate, GPU memory hierarchies are becoming more and more complex, and the Last Level Cache (LLC) size considerably increases each GPU generation. This paper shows that counter-intuitively, enlarging the LLC brings marginal performance gains in most applications. In other words, increasing the LLC size does not scale neither in performance nor energy consumption. We examine how LLC misses are managed in typical GPUs, and we find that in most cases the way LLC misses are managed are precisely the main performance limiter. This paper proposes a novel approach that addresses this shortcoming by leveraging a tiny additional Fetch and Replacement Cache-like structure (FRC) that stores control and coherence information of the incoming blocks until they are fetched from main memory. Then, the fetched blocks are swapped with the victim blocks (i.e., selected to be replaced) in the LLC, and the eviction of such victim blocks is performed from the FRC. This approach improves performance due to three main reasons: i) the lifetime of blocks being replaced is enlarged, ii) the main memory path is unclogged on long bursts of LLC misses, and iii) the average LLC miss latency is reduced. The proposal improves the LLC hit ratio, memory-level parallelism, and reduces the miss latency compared to much larger conventional caches. Moreover, this is achieved with reduced energy consumption and with much less area requirements. Experimental results show that the proposed FRC cache scales in performance with the number of GPU compute units and the LLC size, since, depending on the FRC size, performance improves ranging from 30 to 67 percent for a modern baseline GPU card, and from 32 to 118 percent for a larger GPU. In addition, energy consumption is reduced on average from 49 to 57 percent for the larger GPU. These benefits come with a small area increase (by 7.3 percent) over the LLC baseline.This work has been supported by the Spanish Ministerio de Ciencia, Innovacion y Universidades and the European ERDF under Grants T-PARCCA (RTI2018-098156-B-C51), and TIN2016-76635-C2-1-R (AEI/ERDF, EU), by the Universitat Politecnica de Valencia under Grant SP20190169, and by the gaZ: T58_17R research group (Aragon Gov. and European ESF).Candel-Margaix, F.; Valero Bresó, A.; Petit Martí, SV.; Sahuquillo Borrás, J. (2019). Efficient Management of Cache Accesses to Boost GPGPU Memory Subsystem Performance. IEEE Transactions on Computers. 68(10):1442-1454. https://doi.org/10.1109/TC.2019.2907591S14421454681

    Segment Switching: A New Switching Strategy for Optical HPC Networks

    Full text link
    [EN] Photonics are becoming realistic technologies for implementing interconnection networks in near future Exascale supercomputer systems. Photonics present key features to design high-performance and scalable supercomputer networks, such as higher bandwidth and lower latencies than their electronic supercomputer networks counterparts. Some research work is focused on conventional network topologies built with photonic technologies, with the aim of taking advantage of photonic characteristics. Nevertheless, these approaches fail in that they keep low the network utilization. We looked into this downside and we found that circuit switching was the main performance limitation. In this article we propose a new switching mechanism, called Segment Switching, to address this constraint and improve the network utilization. Segment Switching splits the circuit in segments of the whole path, and uses buffering on selected nodes on the network. Experimental results show that the devised approach signicantly outperforms photonic circuit switching in conventional torus and fat tree networks by 70% and 90%, respectively.This work was supported in part by the Ministerio de Ciencia, Innovacion y Universidades and in part by the European ERDF under Grant RTI2018-098156-B-C51.Duro, J.; Petit Martí, SV.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2021). Segment Switching: A New Switching Strategy for Optical HPC Networks. IEEE Access. 9:43095-43106. https://doi.org/10.1109/ACCESS.2021.3058135S4309543106

    Exploiting Reuse Information to Reduce Refresh Energy in On-Chip eDRAM Caches

    Full text link
    © Owner/Author 2013. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ICS '13 Proceedings of the 27th international ACM conference on International conference on supercomputing; http://dx.doi.org/10.1145/2464996.2467278.This work introduces a novel refresh mechanism that leverages reuse information to decide which blocks should be refreshed in an energy-aware eDRAM last-level cache. Experimental results show that, compared to a conventional eDRAM cache, the energy-aware approach achieves refresh energy savings up to 71%, while the reduction on the overall dynamic energy is by 65% with negligible performance losses.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO) and Plan E funds, under Grants TIN-2009-14475-C04-01 and TIN2012-38341-C04-01.Valero Bresó, A.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2013). Exploiting Reuse Information to Reduce Refresh Energy in On-Chip eDRAM Caches. ACM. https://doi.org/10.1145/2464996.2467278

    Efficient register renaming and recovery for high-performance processors

    Full text link
    © © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.”Modern superscalar processors implement register renaming using either random access memory (RAM) or content-addressable memories (CAM) tables. The design of these structures should address both access time and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents them from scaling with the number of physical registers and pipeline width, negatively impacting performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM enables immediate recovery. Experimental results show that in a four-way state-ofthe- art superscalar processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme, while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid RAM CAM scheme does not exceed the area required by conventional renaming mechanisms.This work was supported in part by the Spanish MINECO under Grant TIN2012-38341-C04-01.Petit Martí, SV.; Ubal Tena, R.; Sahuquillo Borrás, J.; López Rodríguez, PJ. (2014). Efficient register renaming and recovery for high-performance processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 22(7):1506-1514. https://doi.org/10.1109/TVLSI.2013.2270001S1506151422

    Cache-Hierarchy contention-aware scheduling in CMPs

    Full text link
    © © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other worksTo improve chip multiprocessor (CMP) performance, recent research has focused on scheduling strategies to mitigate main memory bandwidth contention. Nowadays, commercial CMPs implement multilevel cache hierarchies that are shared by several multithreaded cores. In this microprocessor design, contention points may appear along the whole memory hierarchy. Moreover, this problem is expected to aggravate in future technologies, since the number of cores and hardware threads, and consequently the size of the shared caches increase with each microprocessor generation. This paper characterizes the impact on performance of the different contention points that appear along the memory subsystem. The analysis shows that some benchmarks are more sensitive to contention in higher levels of the memory hierarchy (e.g., shared L2) than to main memory contention. In this paper, we propose two generic scheduling strategies for CMPs. The first strategy takes into account the available bandwidth at each level of the cache hierarchy. The strategy selects the processes to be coscheduled and allocates them to cores to minimize contention effects. The second strategy also considers the performance degradation each process suffers due to contention-aware scheduling. Both proposals have been implemented and evaluated in a commercial single-threaded quad-core processor with a relatively small two-level cache hierarchy. The proposals reach, on average, a performance improvement by 5.38 and 6.64 percent when compared with the Linux scheduler, while this improvement is by 3.61 percent for an state-of-the-art memory contention-aware scheduler under the evaluated mixes.This work was supported by the Spanish MINECO under Grant TIN2012-38341-C04-01, and by the Universitat Politecnica de Valencia under Grant PAID-05-12 SP20120748.Feliu Pérez, J.; Petit Martí, SV.; Sahuquillo Borrás, J.; Duato Marín, JF. (2014). Cache-Hierarchy contention-aware scheduling in CMPs. IEEE Transactions on Parallel and Distributed Systems. 25(3):581-590. https://doi.org/10.1109/TPDS.2013.61S58159025

    Addressing fairness in SMT multicores with a progress-aware scheduler

    Full text link
    © 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Current SMT (simultaneous multithreading) processors co-schedule jobs on the same core, thus sharing core resources like L1 caches. In SMT multicores, threads also compete among themselves for uncore resources like the LLC (last level cache) and DRAM modules. Per process performance degradation over isolated execution mainly depends on process resource requirements and the resource contention induced by co-runners. Consequently, the running processes progress at different pace. If schedulers are not progress aware, the unpredictable execution time caused by unfairness can introduce undesirable behaviors on the system such as difficulties to keep priority-based scheduling. This work proposes a job scheduler for SMT multicores that provides fairness to the execution of multiprogrammed workloads. To this end, the scheduler estimates per-process standalone performance by periodically creating low-contention co-schedules. These estimates are used to compute the per process progress. Then, those processes with less progress are prioritized to enhance fairness. Experimental results on a Intel Xeon with six dual-threaded SMT cores show that the proposed scheduler reduces unfairness, on average, by 3× over Linux OS. Moreover, thanks to the tread to core allocation policy, the scheduler slightly improves throughput and turnaround time.This work was supported by the Spanish Ministerio de Econom´ıa y Competitividad (MINECO) and Plan E funds, under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program AwardFeliu Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2015). Addressing fairness in SMT multicores with a progress-aware scheduler. IEEE. https://doi.org/10.1109/IPDPS.2015.48

    Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores

    Full text link
    [EN] Nowadays, high performance multicore processors implement multithreading capabilities. The processes running concurrently on these processors are continuously competing for the shared resources, not only among cores, but also within the core. While resource sharing increases the resource utilization, the interference among processes accessing the shared resources can strongly affect the performance of individual processes and its predictability. In this scenario, process scheduling plays a key role to deal with performance and fairness. In this work we present a process scheduler for SMT multicores that simultaneously addresses both performance and fairness. This is a major design issue since scheduling for only one of the two targets tends to damage the other. To address performance, the scheduler tackles bandwidth contention at the L1 cache and main memory. To deal with fairness, the scheduler estimates the progress experienced by the processes, and gives priority to the processes with lower accumulated progress. Experimental results on an Intel Xeon E5645 featuring six dual-threaded SMT cores show that the proposed scheduler improves both performance and fairness over two state-of-the-art schedulers and the Linux OS scheduler. Compared to Linux, unfairness is reduced to a half while still improving performance by 5.6 percent.We thank the anonymous reviewers for their constructive and insightful feedback. This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2015-66972-C5-1-R and TIN2014-62246EXP, and by the Intel Early Career Faculty Honor Program Award.Feliu-Pérez, J.; Sahuquillo Borrás, J.; Petit Martí, SV.; Duato Marín, JF. (2017). Perf&Fair: A Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores. IEEE Transactions on Computers. 66(5):905-911. https://doi.org/10.1109/TC.2016.2620977S90591166
    corecore